Introduction

This Rmarkdown script (and corresponding TAB-separated CSV input data file InfoRateData.csv and the resulting HTML document) contain the full analysis and plotting code accompanying the paper Human languages share an optimal information transmission rate.

The data

For more information on the data, please see Oh (2015). There are in total 17 languages (see the Table below).

The Oral Corpus

The oral corpus is based on a subset of the Multext (Multilingual Text Tools and Corpora) parallel corpus (Campione & Véronis, 1998) in British English, German, and Italian. The material consists of 15 short texts of 3-5 semantically connected sentences carefully translated by a native speaker in each language.

For the other 14 languages, two of the authors supervised the translation and recording of new datasets. All participants were native speakers of the target language, with a focus on a specific variety of the language when possible – e.g. Mandarin spoken in Beijing, Serbian in Belgrade and Korean in Seoul. No strict control on age or on the speakers’ social diversity was performed, but speakers were mainly students or members of academic institutions. Speakers were asked to read three times (first silently and then loudly twice) each text. The texts were presented one by one on the screen in random order, in a self-paced reading paradigm. This way, speakers familiarized themselves with the text and reduce their reading errors. The second loud recording was analyzed in this study.

Text Corpus

Text datasets were acquired from various sources as illustrated in the Table below. After an initial data curation, each dataset was phonetically transcribed and automatically syllabified by a rule-based program written by one of the authors, except in the following cases:

  1. when syllabification was already provided with the dataset (English, French, German, and Vietnamese for the multisyllabic words);
  2. when the corpus was syllabified by an automatic grapheme-to-phoneme converter (Catalan, Spanish, and Thai).

Additionally, no syllabification was required for Sino-Tibetan languages (Cantonese and Mandarin Chinese) since one ideogram corresponds to one syllable.

Language Family ISO 639-3 Corpus
Basque Basque EUS E-Hitz (Perea et al., 2006)
British English Indo-European ENG WebCelex (MPI for Psycholinguistics)
Cantonese Sino-Tibetan YUE A linguistic corpus of mid-20th century Hong Kong Cantonese
Catalan Indo-European CAT Frequency dictionary (Zséder et al., 2012)
Finnish Uralic FIN Finnish Parole Corpus
French Indo-European FRA Lexique 3.80 (New et al., 2001)
German Indo-European DEU WebCelex (MPI for Psycholinguistics)
Hungarian Uralic HUN Hungarian National Corpus (Váradi, 2002)
Italian Indo-European ITA The Corpus PAISÀ (Lyding et al., 2014)
Japanese Japanese JPN Japanese Internet Corpus (Sharoff, 2006)
Korean Korean KOR Leipzig Corpora Collection (LCC)
Mandarin Chinese Sino-Tibetan CMN Chinese Internet Corpus (Sharoff, 2006)
Serbian Indo-European SRP Frequency dictionary (Zséder et al., 2012)
Spanish Indo-European SPA Frequency dictionary (Zséder et al., 2012)
Thai Tai-Kadai THA Thai National Corpus (TNC)
Turkish Turkic TUR Leipzig Corpora Collection (LCC)
Vietnamese Austroasiatic VIE VNSpeechCorpus (Le et al., 2004)

Dataset structure

The data is structured as follows:

  • Language is the unique language ISO 639-3 ID with 17 possible values “CAT”, “CMN”, “DEU”, “ENG”, “EUS”, “FIN”, “FRA”, “HUN”, “ITA”, “JPN”, “KOR”, “SPA”, “SRP”, “THA”, “TUR”, “VIE”, “YUE”;
  • Language family is the language family with 9 possible values “Austroasiatic”, “Basque”, “Indo-European”, “Japanese”, “Korean”, “Sino-Tibetan”, “Tai-Kadai”, “Turkic”, “Uralic”;
  • Text is the text identifier (similar across languages and speakers) with 15 possible values “O1”, “O2”, “O3”, “O4”, “O6”, “O8”, “O9”, “P0”, “P1”, “P2”, “P3”, “P8”, “P9”, “Q0”, “Q1”;
  • Speaker is the unique speaker’s ID (there are 170 in the dataset);
  • Sex is the speaker’s sex;
  • Age is the speaker’s age (available only for a subset of 132 speakers);
  • Duration is the texts’ duration in seconds as spoken by the given speaker (pauses longer than 150ms being excluded);
  • NS is the texts’ number of of syllables (the same for a given text in a given language);
  • SR is the speech rate (syllables/second) for a given text spoken by a given speaker;
  • ShE (resp. ID) is the first-order (resp. second-order) language-level entropy estimate;
  • ShIR and IR are text * speaker properties obtained by multiplying SR by ShE and ID respectively.

Exploratory plots and summaries

Speaker characteristics

Distribution of speaker number and characteristics (sex and age) by language. Lng=language, # spkrs=number of speakers, % fem=percent female speakers, # age=number of speakers with age info, and for those, mean(age)=mean age, sd(age)=standard deviation of age, and actual ages=the sorted ages.
Lng # spkrs % fem # age mean(age) sd(age) actual ages
CAT 10 50 10 35.4 9.2 (21, 28, 28, 29, 31, 39, 42, 42, 44, 50)
CMN 10 50 9 23.1 4.5 (19, 19, 19, 19, 24, 24, 25, 28, 31)
DEU 10 50 0 NaN NaN ()
ENG 10 50 0 NaN NaN ()
EUS 10 50 10 28.0 4.9 (19, 22, 26, 27, 28, 29, 30, 31, 32, 36)
FIN 10 50 10 33.2 11.0 (16, 22, 26, 28, 30, 35, 37, 41, 45, 52)
FRA 10 50 10 32.5 7.7 (24, 25, 25, 27, 28, 36, 36, 37, 41, 46)
HUN 10 50 10 39.3 15.8 (17, 27, 27, 31, 33, 39, 42, 51, 57, 69)
ITA 10 50 0 NaN NaN ()
JPN 10 50 10 30.6 12.8 (20, 20, 21, 22, 22, 28, 29, 40, 51, 53)
KOR 10 50 10 28.6 10.6 (16, 19, 19, 19, 28, 31, 33, 35, 36, 50)
SPA 10 50 10 33.7 10.1 (21, 22, 26, 28, 30, 32, 42, 42, 44, 50)
SRP 10 50 10 30.6 7.8 (19, 21, 23, 30, 31, 32, 34, 34, 38, 44)
THA 10 50 10 30.1 5.7 (23, 23, 27, 28, 30, 31, 31, 32, 33, 43)
TUR 10 50 7 32.6 7.2 (24, 25, 30, 31, 37, 37, 44)
VIE 10 50 6 27.2 4.1 (21, 25, 26, 28, 31, 32)
YUE 10 50 10 22.0 1.5 (20, 20, 21, 21, 22, 22, 23, 23, 24, 24)

NS

NS: exploratory plots.

NS: exploratory plots.

mean=101.217, median=100, sd=24.694, CV=0.244, min=49, max=162, kurtosis=2.485, skewness=0.178.

SR

SR: exploratory plots.

SR: exploratory plots.

SR per speaker.

SR per speaker.

SR by Sex and Age across Languages.

SR by Sex and Age across Languages.

SR by Sex, Age and Language.

SR by Sex, Age and Language.

SR by language.

SR by language.

mean=6.612, median=6.76, sd=1.133, CV=0.171, min=3.59, max=9.38, kurtosis=2.382, skewness=-0.197.

ShE and ID

ShE and ID: exploratory plots.

ShE and ID: exploratory plots.

ShE vs ID.

ShE vs ID.


    Pearson's product-moment correlation

data:  tmp1$ShE and tmp1$ID
t = 2.0326, df = 15, p-value = 0.06019
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02052914  0.77274779
sample estimates:
      cor 
0.4647009 

    Spearman's rank correlation rho

data:  tmp1$ShE and tmp1$ID
S = 451.88, p-value = 0.07259
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4462208 

    Paired t-test

data:  tmp1$ShE and tmp1$ID
t = 11.635, df = 16, p-value = 3.213e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.158040 3.119607
sample estimates:
mean of the differences 
               2.638824 

ShE:

mean=8.619, median=8.69, sd=0.906, CV=0.105, min=6.07, max=9.83, kurtosis=4.662, skewness=-1.124.

ID:

mean=6.009, median=5.56, sd=0.883, CV=0.147, min=4.83, max=8.02, kurtosis=2.521, skewness=0.741.

ShIR and IR

ShIR: exploratory plots.

ShIR: exploratory plots.

ShIR per speaker.

ShIR per speaker.

ShIR by Sex and Age across Languages.

ShIR by Sex and Age across Languages.

ShIR by Sex, Age and Language.

ShIR by Sex, Age and Language.

ShIR by language.

ShIR by language.

IR: exploratory plots.

IR: exploratory plots.

IR per speaker.

IR per speaker.

IR by Sex and Age across Languages.

IR by Sex and Age across Languages.

IR by Sex, Age and Language.

IR by Sex, Age and Language.

IR by language.

IR by language.

ShIR:

mean=56.525, median=56.92, sd=9.217, CV=0.163, min=32.77, max=83.04, kurtosis=2.335, skewness=0.048.

IR:

mean=39.033, median=39.05, sd=4.945, CV=0.127, min=25.64, max=56.3, kurtosis=3.422, skewness=0.241.

SR and ID

SR vs ID

SR vs ID


    Pearson's product-moment correlation

data:  info.rate.data$SR and info.rate.data$ID
t = -46.635, df = 2263, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7204624 -0.6784227
sample estimates:
       cor 
-0.7000486 

    Spearman's rank correlation rho

data:  info.rate.data$SR and info.rate.data$ID
S = 3315700000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.7120735 

Linear Mixed Modells (LMMs)

Linear ICCs

ICC for model NS ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.30
Language 0.54
Speaker 0.00
ICC for model SR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Language 0.62
Speaker 0.29
ICC for model ShIR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Language 0.56
Speaker 0.34
ICC for model IR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.02
Language 0.36
Speaker 0.49

SR

Comparing models

AIC and BIC for various hierarchical models of SR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Language) + (1 | Speaker) 1998.03 2026.65
1 + (1 | Language) + (1 | Speaker) 2269.3 2292.2
1 + (1 | Text) + (1 | Speaker) 2131.96 2154.86
1 + (1 | Text) + (1 | Language) 4490.99 4513.89
1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) 1990.57 2024.92
1 + Sex + (1 | Language) + (1 | Speaker) 2261.69 2290.31
1 + Sex + (1 | Text) + (1 | Speaker) 2131.72 2160.35
1 + Sex + (1 | Text) + (1 | Language) 4364.38 4393.01

Residuals and random effects

We consider here the full model SR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).

IR

Comparing models

AIC and BIC for various hierarchical models of IR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Language) + (1 | Speaker) 10041.03 10069.65
1 + (1 | Language) + (1 | Speaker) 10307.63 10330.53
1 + (1 | Text) + (1 | Speaker) 10093.09 10115.99
1 + (1 | Text) + (1 | Language) 12574.43 12597.33
1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) 10029.53 10063.88
1 + Sex + (1 | Language) + (1 | Speaker) 10295.98 10324.6
1 + Sex + (1 | Text) + (1 | Speaker) 10086.35 10114.97
1 + Sex + (1 | Text) + (1 | Language) 12440.04 12468.67

Residuals and random effects

We consider here the full model IR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).

Generalized Additive Models for Location, Scale and Shape (GAMLSS)

We will use a Gaussian distribution (with fixed or modelled variance).

SR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  6.666774e-05 
                       variance   =  1.000442 
               coef. of skewness  =  -0.0552563 
               coef. of kurtosis  =  3.323173 
Filliben correlation coefficient  =  0.9990793 
******************************************************************


Deviance= 1048.516 

AIC= 1447.603 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001738762 
                       variance   =  1.000438 
               coef. of skewness  =  -0.02112376 
               coef. of kurtosis  =  2.757837 
Filliben correlation coefficient  =  0.9991447 
******************************************************************


Deviance= 719.0835 

AIC= 1299.257 

The distribution of the residuals is less heteroscedastic than before and the fit to the data better. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex + random(Text) + random(Language) +      random(Speaker), sigma.formula = ~1 + Sex + random(Text) +      random(Language) + random(Speaker), family = NO(mu.link = "identity"),  
    data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.453290   0.007618  847.14   <2e-16 ***
SexM        0.320159   0.011487   27.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.30948    0.02098 -62.417  < 2e-16 ***
SexM         0.11194    0.02972   3.767  0.00017 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2265 
Degrees of Freedom for the fit:  290.0869
      Residual Deg. of Freedom:  1974.913 
                      at cycle:  48 
 
Global Deviance:     719.0835 
            AIC:     1299.257 
            SBC:     2960.101 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.3977 
Random effect parameter sigma_b: 0.108314 
Smoothing parameter lambda     : 85.7824 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.9836 
Random effect parameter sigma_b: 0.87066 
Smoothing parameter lambda     : 1.32915 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.2778 
Random effect parameter sigma_b: 0.555308 
Smoothing parameter lambda     : 3.49815 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.583681 
Random effect parameter sigma_b: 0.0108169 
Smoothing parameter lambda     : 7523.57 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.91065 
Random effect parameter sigma_b: 0.161228 
Smoothing parameter lambda     : 34.0803 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 73.93344 
Random effect parameter sigma_b: 0.165342 
Smoothing parameter lambda     : 33.2784 

IR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.0002550467 
                       variance   =  1.000442 
               coef. of skewness  =  0.01302313 
               coef. of kurtosis  =  3.360878 
Filliben correlation coefficient  =  0.9990567 
******************************************************************


Deviance= 9108.279 

AIC= 9507.442 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001906696 
                       variance   =  1.000449 
               coef. of skewness  =  -0.02187398 
               coef. of kurtosis  =  2.755234 
Filliben correlation coefficient  =  0.9991259 
******************************************************************


Deviance= 8782.878 

AIC= 9364.641 

Again, this is a better fit to the data. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex + random(Text) + random(Language) +      random(Speaker), sigma.formula = ~1 + Sex + random(Text) +      random(Language) + random(Speaker), family = NO(mu.link = "identity"),  
    data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.08916    0.04521  842.56   <2e-16 ***
SexM         1.89529    0.06851   27.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.47207    0.02098  22.501  < 2e-16 ***
SexM         0.11385    0.02972   3.831 0.000131 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2265 
Degrees of Freedom for the fit:  290.8814
      Residual Deg. of Freedom:  1974.119 
                      at cycle:  7 
 
Global Deviance:     8782.878 
            AIC:     9364.641 
            SBC:     11030.03 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.42046 
Random effect parameter sigma_b: 0.658373 
Smoothing parameter lambda     : 2.32185 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.9518 
Random effect parameter sigma_b: 3.06988 
Smoothing parameter lambda     : 0.106913 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.1517 
Random effect parameter sigma_b: 3.31521 
Smoothing parameter lambda     : 0.0981439 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.09852989 
Random effect parameter sigma_b: 0.00428681 
Smoothing parameter lambda     : 47821.8 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.96383 
Random effect parameter sigma_b: 0.163535 
Smoothing parameter lambda     : 33.0774 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 75.29513 
Random effect parameter sigma_b: 0.168066 
Smoothing parameter lambda     : 32.181 

Modelling the relationship between SR and ID

Let’s model SR with ID as an additional predictor (fixed effect). N.B. In this case, we must drop Language as a random effect, since each language has, by definition, only one value of ID.

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + ID + Sex + random(Text) +      random(Speaker), sigma.formula = ~1 + ID + Sex +      random(Text) + random(Speaker), family = NO(mu.link = "identity"),      data = d, control = gamlss.control(n.cyc = 800,  
        trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.794174   0.036777  320.69   <2e-16 ***
ID          -0.888238   0.005842 -152.06   <2e-16 ***
SexM         0.310995   0.011382   27.32   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.79946    0.10475  -7.632 3.57e-14 ***
ID          -0.08642    0.01705  -5.070 4.35e-07 ***
SexM         0.10951    0.02972   3.685 0.000235 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2265 
Degrees of Freedom for the fit:  293.322
      Residual Deg. of Freedom:  1971.678 
                      at cycle:  18 
 
Global Deviance:     686.5481 
            AIC:     1273.192 
            SBC:     2952.558 
******************************************************************

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001881969 
                       variance   =  1.000438 
               coef. of skewness  =  -0.02304788 
               coef. of kurtosis  =  2.677419 
Filliben correlation coefficient  =  0.9989511 
******************************************************************


Deviance= 686.5481 

AIC= 1273.192 

Adding ID as a predictor improves the fits (as judged by AIC). There is a negative estimate for ID, but significance is difficult to assess with GAMLSS model involving smoothing functions. However, also using a simple lmer model we have a significant effect of ID:

Type III Analysis of Variance Table with Satterthwaite's method
     Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
ID  16.4136 16.4136     1 166.55 162.5728 < 2.2e-16 ***
Sex  0.8047  0.8047     1 166.63   7.9707  0.005334 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.42354 
Random effect parameter sigma_b: 0.10986 
Smoothing parameter lambda     : 83.3869 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 167.487 
Random effect parameter sigma_b: 0.74601 
Smoothing parameter lambda     : 1.94032 

σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 1.916828 
Random effect parameter sigma_b: 0.0201916 
Smoothing parameter lambda     : 2061.61 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 103.4947 
Random effect parameter sigma_b: 0.234219 
Smoothing parameter lambda     : 16.0415 

Do Age and Sex matter?

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ Age * Sex + (1 | Text) + (1 | Language)
   Data: info.rate.data

REML criterion at convergence: 3541.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8993 -0.6397  0.0095  0.6216  3.2123 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.01201  0.1096  
 Language (Intercept) 1.01269  1.0063  
 Residual             0.33375  0.5777  
Number of obs: 1959, groups:  Text, 15; Language, 14

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  6.663e+00  2.791e-01  1.499e+01  23.876 2.43e-13 ***
Age         -6.219e-03  2.177e-03  1.930e+03  -2.857  0.00432 ** 
SexM         1.808e-01  8.874e-02  1.928e+03   2.037  0.04174 *  
Age:SexM     3.058e-03  2.770e-03  1.928e+03   1.104  0.26986    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) Age    SexM  
Age      -0.237              
SexM     -0.176  0.682       
Age:SexM  0.170 -0.716 -0.956
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: IR ~ Age * Sex + (1 | Text) + (1 | Language)
   Data: info.rate.data

REML criterion at convergence: 10507.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8289 -0.6365  0.0016  0.6136  3.5323 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept)  0.3901  0.6246  
 Language (Intercept) 10.1922  3.1925  
 Residual             11.8777  3.4464  
Number of obs: 1959, groups:  Text, 15; Language, 14

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)   39.33279    0.96043   20.45736  40.953  < 2e-16 ***
Age           -0.04214    0.01298 1933.79776  -3.247  0.00119 ** 
SexM           1.08599    0.52934 1929.48938   2.052  0.04034 *  
Age:SexM       0.01915    0.01653 1929.60714   1.159  0.24671    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) Age    SexM  
Age      -0.411              
SexM     -0.305  0.682       
Age:SexM  0.294 -0.716 -0.956

So, it seems * Age* and Sex are both worth including in our models (even if we have to discard quite a bit of data because of missing Age info). (In fact, the effect of Age seems more significant than that of Sex.)

In the following, we investigate if Age does matter when using GAMLSS modelling…

GAMLSS with Age

Bescause there is missing data fro Age, and because the GAMLSS models require no missing data, we will fit the models with Age (and its interaction with Sex) on the subset of the data that contains only those speakers with Age info. To make comparability possible, we also fit the same models but without Age on the exact same subset of the data.

SR

Fixed or modelled σ (variance)
Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  -6.761517e-05 
                       variance   =  1.000511 
               coef. of skewness  =  -0.07716915 
               coef. of kurtosis  =  3.390812 
Filliben correlation coefficient  =  0.998745 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Speaker), family = NO(mu.link = "identity"),      data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800,  
        trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.611763   0.033552 197.060  < 2e-16 ***
SexM         0.165405   0.045837   3.609 0.000316 ***
Age         -0.003179   0.001050  -3.029 0.002487 ** 
SexM:Age     0.003528   0.001428   2.470 0.013601 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.18825    0.01598  -74.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1959 
Degrees of Freedom for the fit:  161.6144
      Residual Deg. of Freedom:  1797.386 
                      at cycle:  33 
 
Global Deviance:     903.8325 
            AIC:     1227.061 
            SBC:     2128.901 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 903.8 1227.1
Age + Sex 903.8 1225.1
Sex 903.8 1223.1

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males), adding it does not seem to be warranted here…

Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.0022763 
                       variance   =  1.000505 
               coef. of skewness  =  -0.03406773 
               coef. of kurtosis  =  2.772961 
Filliben correlation coefficient  =  0.9990489 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Speaker), sigma.formula = ~1 +      Sex * Age + random(Text) + random(Language) + random(Speaker),  
    family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),      i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.6375862  0.0269709 246.101  < 2e-16 ***
SexM         0.2126714  0.0400028   5.316 1.20e-07 ***
Age         -0.0040916  0.0008478  -4.826 1.52e-06 ***
SexM:Age     0.0019632  0.0012742   1.541    0.124    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.381266   0.080544 -17.149   <2e-16 ***
SexM         0.039702   0.108394   0.366    0.714    
Age          0.002071   0.002525   0.820    0.412    
SexM:Age     0.002463   0.003381   0.728    0.466    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1959 
Degrees of Freedom for the fit:  242.0172
      Residual Deg. of Freedom:  1716.983 
                      at cycle:  93 
 
Global Deviance:     626.8351 
            AIC:     1110.87 
            SBC:     2461.371 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 626.8 1110.9
Age + Sex 627.3 1107.1
Sex 626.8 1104

So, even if Age has a significant (negative) effect (but no interaction with Sex), adding it does not seem to be warranted here either…

The distribution of the residuals is less heteroscedastic than before and the fit to the data better.

Summary

Thus, for SR, even if there is a hint that Age might affect it negatively (and there might also be an interaction with Sex with a positive effect for males), overall, the various fit indices do not warrant its inclusion in the GAMLSS models.

IR

Fixed or modelled σ (variance)
Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.000175519 
                       variance   =  1.000511 
               coef. of skewness  =  -0.01914536 
               coef. of kurtosis  =  3.370931 
Filliben correlation coefficient  =  0.9989951 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Speaker), family = NO(mu.link = "identity"),      data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800,  
        trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.626506   0.196112 202.061  < 2e-16 ***
SexM         0.082709   0.267917   0.309    0.758    
Age         -0.053300   0.006135  -8.688  < 2e-16 ***
SexM:Age     0.050696   0.008348   6.073 1.53e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.57734    0.01598   36.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1959 
Degrees of Freedom for the fit:  161.7831
      Residual Deg. of Freedom:  1797.217 
                      at cycle:  3 
 
Global Deviance:     7821.4 
            AIC:     8144.966 
            SBC:     9047.746 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 7821.4 8145
Age + Sex 7821.4 8143
Sex 7821.4 8141

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…

Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001952693 
                       variance   =  1.000561 
               coef. of skewness  =  -0.03448375 
               coef. of kurtosis  =  2.780376 
Filliben correlation coefficient  =  0.9990344 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Speaker), sigma.formula = ~1 +      Sex * Age + random(Text) + random(Language) + random(Speaker),  
    family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),      i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.593055   0.157762 250.966  < 2e-16 ***
SexM         0.179628   0.233445   0.769    0.442    
Age         -0.052410   0.004872 -10.757  < 2e-16 ***
SexM:Age     0.047581   0.007310   6.509  9.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4322689  0.0804947   5.370 8.94e-08 ***
SexM        0.0240788  0.1084567   0.222    0.824    
Age         0.0007877  0.0025235   0.312    0.755    
SexM:Age    0.0029496  0.0033829   0.872    0.383    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1959 
Degrees of Freedom for the fit:  242.0359
      Residual Deg. of Freedom:  1716.964 
                      at cycle:  5 
 
Global Deviance:     7559.015 
            AIC:     8043.087 
            SBC:     9393.693 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 7559 8043.1
Age + Sex 7559.5 8039.4
Sex 7558.6 8036.1

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…

The distribution of the residuals is less heteroscedastic than before and the fit to the data better.

Summary

Thus, while for IR the hint that Age has a negative main effect and interacts with Sex (with a positive effect for males, containing the whole effect of Sex) is much stronger, the various fit indices do not warrant its inclusion in the GAMLSS models.

Modelling the relationship between SR and ID

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001627695 
                       variance   =  1.000508 
               coef. of skewness  =  -0.04476961 
               coef. of kurtosis  =  2.716779 
Filliben correlation coefficient  =  0.99893 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + ID + Sex * Age + random(Text) +      random(Speaker), sigma.formula = ~1 + ID + Sex *      Age + random(Text) + random(Speaker), family = NO(mu.link = "identity"),  
    data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 12.6730928  0.0541143  234.191   <2e-16 ***
ID          -0.9841747  0.0069997 -140.602   <2e-16 ***
SexM        -0.0687297  0.0401500   -1.712   0.0871 .  
Age         -0.0101486  0.0008593  -11.811   <2e-16 ***
SexM:Age     0.0109265  0.0012842    8.509   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.7738535  0.1464447  -5.284 1.42e-07 ***
ID          -0.0931526  0.0192597  -4.837 1.44e-06 ***
SexM        -0.0109575  0.1092433  -0.100    0.920    
Age          0.0001475  0.0025388   0.058    0.954    
SexM:Age     0.0041300  0.0034099   1.211    0.226    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1959 
Degrees of Freedom for the fit:  237.3435
      Residual Deg. of Freedom:  1721.656 
                      at cycle:  9 
 
Global Deviance:     606.1764 
            AIC:     1080.863 
            SBC:     2405.285 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 606.2 1080.9
Age + Sex 606.4 1077.2
Sex 606.4 1073.5

Clearly, adding Age is not warranted here…

As above, we also looked a the simple lmer model:

The compared models are:

Model AIC
Age * Sex 1700.1
Age + Sex 1691.9
Sex 1682.2

Interestingly, here the best model (as suggested by AIC) is the one with both Age and Sex (but no interaaction), but, as can be seen below, actually the effect of Age is not significant (p = 0.482), suggesting that, in fact, the best model is still the one not including Age:

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ 1 + ID + Sex + Age + (1 | Text) + (1 | Speaker)
   Data: info.rate.data.for.age

REML criterion at convergence: 1677.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9435 -0.6442  0.0340  0.5995  3.0185 

Random effects:
 Groups   Name        Variance Std.Dev.
 Speaker  (Intercept) 0.50570  0.7111  
 Text     (Intercept) 0.01462  0.1209  
 Residual             0.10011  0.3164  
Number of obs: 1959, groups:  Speaker, 132; Text, 15

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  12.458038   0.526996 128.797113  23.640   <2e-16 ***
ID           -0.976300   0.075151 127.920530 -12.991   <2e-16 ***
SexM          0.271787   0.124633 127.912972   2.181    0.031 *  
Age          -0.004619   0.006545 127.895657  -0.706    0.482    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr) ID     SexM  
ID   -0.910              
SexM -0.114 -0.005       
Age  -0.521  0.167 -0.005

Conclusions about Age

Within the limits of our reduced dataset (containing only 132 speakers with Age info), we found the following:

When modelling SR and IR with GAMLSS, while there are hints that Age has, for both, overall:

  • a negative main effect (i.e., older speakers of both sexes are slower and transmit less information per unit time across texts and languages), and
  • an interaction with Sex (positive for males, i.e., males are less negatively affected by age that females across texts and languages),

it does not seem warranted to include it in these models.

When modelling the relationship between SR and ID, this negative relationship:

  • seems strengthened with increasing Age (i.e., older speakers of both sexes shows a stronger negative relationship between SR and ID), and
  • there is an interaction with Sex (for females, this effect of Age is stronger than for males),

but, alas, the inclusion of Age is not warranted in the GAMLSS model, nor (really) in the simpler LMER model.

Thus, while Age seems to negatively influence (in a sex-dependent manner) both SR and IR, as well as strengthen the negative relationship between them, its effects are far from clear in the current dataset.

Mixture of Gaussians

In what follows, mixing probabilities are independent from factors such as Sex.

SR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.612  
Sigma Coefficients for model: 1 
(Intercept)  
     0.1248  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2263 
Global Deviance:     6993.16 
            AIC:     6997.16 
            SBC:     7008.61 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      5.293  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.4553  
Mu Coefficients for model: 2 
(Intercept)  
      7.157  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.2298  

Estimated probabilities: 0.2925825 0.7074175 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2260 
Global Deviance:     6861.89 
            AIC:     6871.89 
            SBC:     6900.52 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
       5.23  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.4888  
Mu Coefficients for model: 2 
(Intercept)  
      6.612  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.1508  
Mu Coefficients for model: 3 
(Intercept)  
      7.271  
Sigma Coefficients for model: 3 
(Intercept)  
     -0.263  

Estimated probabilities: 0.2537915 0.2147315 0.531477 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2257 
Global Deviance:     6862.04 
            AIC:     6878.04 
            SBC:     6923.84 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      7.171  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.2973  
Mu Coefficients for model: 2 
(Intercept)  
      5.092  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.5485  
Mu Coefficients for model: 3 
(Intercept)  
      5.806  
Sigma Coefficients for model: 3 
(Intercept)  
     -0.454  
Mu Coefficients for model: 4 
(Intercept)  
      7.316  
Sigma Coefficients for model: 4 
(Intercept)  
    -0.2168  

Estimated probabilities: 0.4443139 0.1849648 0.1507718 0.2199495 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2254 
Global Deviance:     6861.1 
            AIC:     6883.1 
            SBC:     6946.08 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.375  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.3485  
Mu Coefficients for model: 2 
(Intercept)  
      7.228  
Sigma Coefficients for model: 2 
(Intercept)  
     -1.001  
Mu Coefficients for model: 3 
(Intercept)  
      7.969  
Sigma Coefficients for model: 3 
(Intercept)  
    -0.5569  
Mu Coefficients for model: 4 
(Intercept)  
      6.248  
Sigma Coefficients for model: 4 
(Intercept)  
    -0.4579  
Mu Coefficients for model: 5 
(Intercept)  
      5.041  
Sigma Coefficients for model: 5 
(Intercept)  
    -0.6049  

Estimated probabilities: 0.1788327 0.2320332 0.1993053 0.2001806 0.1896483 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2251 
Global Deviance:     6844.22 
            AIC:     6872.22 
            SBC:     6952.37 

Comparing AIC

            df      AIC
mix.SR.NO.2  5 6871.892
mix.SR.NO.5 14 6872.218
mix.SR.NO.3  8 6878.040
mix.SR.NO.4 11 6883.103
mix.SR.NO.1  2 6997.163

Showing the distributions

Mixture of Gaussians for SR.

Mixture of Gaussians for SR.

IR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      39.03  
Sigma Coefficients for model: 1 
(Intercept)  
      1.598  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2263 
Global Deviance:     13667 
            AIC:     13671 
            SBC:     13682.5 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      39.76  
Sigma Coefficients for model: 1 
(Intercept)  
      1.766  
Mu Coefficients for model: 2 
(Intercept)  
      38.36  
Sigma Coefficients for model: 2 
(Intercept)  
      1.336  

Estimated probabilities: 0.4795038 0.5204962 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2260 
Global Deviance:     13637.8 
            AIC:     13647.8 
            SBC:     13676.4 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      35.42  
Sigma Coefficients for model: 1 
(Intercept)  
      1.339  
Mu Coefficients for model: 2 
(Intercept)  
      39.78  
Sigma Coefficients for model: 2 
(Intercept)  
       0.89  
Mu Coefficients for model: 3 
(Intercept)  
      42.22  
Sigma Coefficients for model: 3 
(Intercept)  
      1.636  

Estimated probabilities: 0.3631272 0.2946715 0.3422013 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2257 
Global Deviance:     13618.4 
            AIC:     13634.4 
            SBC:     13680.2 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      42.31  
Sigma Coefficients for model: 1 
(Intercept)  
      1.699  
Mu Coefficients for model: 2 
(Intercept)  
      39.43  
Sigma Coefficients for model: 2 
(Intercept)  
     0.6302  
Mu Coefficients for model: 3 
(Intercept)  
      40.07  
Sigma Coefficients for model: 3 
(Intercept)  
      1.316  
Mu Coefficients for model: 4 
(Intercept)  
      34.48  
Sigma Coefficients for model: 4 
(Intercept)  
       1.25  

Estimated probabilities: 0.2568272 0.1726828 0.3017301 0.2687599 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2254 
Global Deviance:     13613.6 
            AIC:     13635.6 
            SBC:     13698.6 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      42.85  
Sigma Coefficients for model: 1 
(Intercept)  
      1.694  
Mu Coefficients for model: 2 
(Intercept)  
      37.38  
Sigma Coefficients for model: 2 
(Intercept)  
      1.457  
Mu Coefficients for model: 3 
(Intercept)  
      39.58  
Sigma Coefficients for model: 3 
(Intercept)  
     0.4193  
Mu Coefficients for model: 4 
(Intercept)  
      40.57  
Sigma Coefficients for model: 4 
(Intercept)  
      1.308  
Mu Coefficients for model: 5 
(Intercept)  
      35.24  
Sigma Coefficients for model: 5 
(Intercept)  
      1.337  

Estimated probabilities: 0.2128878 0.2184185 0.1195341 0.2222382 0.2269214 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2251 
Global Deviance:     13612.4 
            AIC:     13640.4 
            SBC:     13720.5 

Comparing AIC

            df      AIC
mix.IR.NO.3  8 13634.37
mix.IR.NO.4 11 13635.64
mix.IR.NO.5 14 13640.36
mix.IR.NO.2  5 13647.78
mix.IR.NO.1  2 13671.04

Showing the distributions

Mixture of Gaussians for IR.

Mixture of Gaussians for IR.

Tests of unimodality

We used three ways to estimate how unimodal a distribution is, as they tend to disagree and the problem of unimodality testing is far from settled (see Freeman & Dale, 2013):

  • the Silverman test tests the null hypothesis that an underlying density has at most k modes. The null hypothesis is that the underlying density has at most k modes (H0: number of modes <= k). The result is the p-value (bootstrapped) of rejecting a unimodal distribution. It is described in Silverman (1981) and Hall & York (2001); our implementation is based on the code available at https://www.mathematik.uni-marburg.de/~stochastik/R_packages/;
  • the dip test which computes Hartigans’ dip statistic D_n_, and its p-value for the test for unimodality, by interpolating tabulated quantiles of sqrt(n) • D_n_. For X_i_F, i.i.d., the null hypothesis is that F is a unimodal distribution. The result is the D metric and the p-value (interpolated) of rejecting a unimodal distribution. See Hartigan (1985) and Hartigan & Hartigan (1985); it is implemented in package diptest;
  • the bimodality coefficient (BC) which is based on an empirical relationship between bimodality and the third and fourth statistical moments of a distribution (skewness and kurtosis). It is proportional to the division of squared skewness with uncorrected kurtosis, BC ≈ (s2 + 1)/k, with the underlying logic that a bimodal distribution will have very low kurtosis, an asymmetric character, or both; all of these conditions increase BC. The values range from 0 and 1, with those exceeding .555 (the value representing a uniform distribution) suggesting bi-modality. The result is the DC estimate (which must exceed 0.555 to reject a unimodal distribution). We implemented it following Freeman & Dale (2013) as BC = (s2 + 1)/(k + 3 • ((n-1)2 / ((n-2) • (n-3)))).

For each such test, we performed four randomisation procedures to obtain an estimate of the “specialness” of the observed unimodality estimate; for each new permuted dataset, we recompute everything before estimating the unimodlaity of the permuted distribution:

  • Permutation model 1 (PM1): randomly permute the SR values freely among speakers, texts and languages;
  • Permutation model 2 (PM2): randomly permute the ID values among languages;
  • Permutation model 3 (PM3): randomly permute the Speaker average SR values among speakers (irrespective of languages);
  • Permutation model 4 (PM4): randomly permute the Language average SR values among languages.

Visual comparison

The observed estimate (vertical blue solid line), the permuted distribution (gray histogram), and the “unimodality region” (shaded green rectangle) are shown below (for PM3, we also show the original estimate using the Speaker average SR as a vertical solid red line).

PM1

**Permutation of the texts' SRs (PM1)**.

Permutation of the texts’ SRs (PM1).

PM2

**Permutation of the languages' ID (PM2)**.

Permutation of the languages’ ID (PM2).

PM3

**Permutation of the speakers' average SRs (PM3)**.

Permutation of the speakers’ average SRs (PM3).

PM4

**Permutation of the languages' average SRs with speaker adjustement (PM4)**.

Permutation of the languages’ average SRs with speaker adjustement (PM4).

Summary of unimodality tests

Summary of unimodality permutation tests.
Scenario Measure Test Observed estimate (p-value) % more unimodal permutations
PM1 SR Silverman - (0.019) 74.2%
PM1 SR Dip 0.006 (0.924) * 100%
PM1 SR BC 0.193 () * 100%
PM1 IR Silverman - (0.17) * 86.8%
PM1 IR Dip 0.005 (0.994) * 50%
PM1 IR BC 0.165 () * 100%
PM2 SR Silverman - (0.019) 75.2%
PM2 SR Dip 0.006 (0.924) * 100%
PM2 SR BC 0.193 () * 100%
PM2 IR Silverman - (0.17) * 22.6%
PM2 IR Dip 0.005 (0.994) * 14.8%
PM2 IR BC 0.165 () * 97.7%
PM3 SR Silverman - (0.019) 100%
PM3 SR Dip 0.006 (0.924) * 0%
PM3 SR BC 0.193 () * 100%
PM3 IR Silverman - (0.17) * 86.2%
PM3 IR Dip 0.005 (0.994) * 6%
PM3 IR BC 0.165 () * 100%
PM4 SR Silverman - (0.019) 5.2%
PM4 SR Dip 0.006 (0.924) * 3.1%
PM4 SR BC 0.193 () * 75.5%
PM4 IR Silverman - (0.17) * 11.1%
PM4 IR Dip 0.005 (0.994) * 5.9%
PM4 IR BC 0.165 () * 97.3%

Pair-wise distances between languages

We compute various distances between languages (as implemented by function distance() in package philentropy) in what concerns the distribution of NS, SR and ID.

Comparing the distribution of pairwise distances between languages.

Comparing the distribution of pairwise distances between languages.

Paired permutation t-tests comparing measures with 1,000 permutations.
m1 m2 d mean1 median1 sd1 mean2 median2 sd2 p
IR NS Hellinger 0.89 0.84 0.32 1.20 1.19 0.39 0.00
IR NS Jensen-Shannon 0.18 0.14 0.12 0.29 0.26 0.16 0.00
IR NS Kolmogorov–Smirnov 0.42 0.37 0.20 0.57 0.57 0.23 0.00
IR NS Kullback-Leibler 7.44 4.77 7.88 15.43 13.12 13.30 0.00
IR NS Squared-Chi 0.57 0.46 0.36 0.88 0.80 0.47 0.00
IR SR Hellinger 0.89 0.84 0.32 1.13 1.09 0.49 0.00
IR SR Jensen-Shannon 0.18 0.14 0.12 0.28 0.24 0.20 0.00
IR SR Kolmogorov–Smirnov 0.42 0.37 0.20 0.57 0.56 0.27 0.00
IR SR Kullback-Leibler 7.44 4.77 7.88 15.90 10.08 16.29 0.00
IR SR Squared-Chi 0.57 0.46 0.36 0.88 0.78 0.58 0.00
NS SR Hellinger 1.20 1.19 0.39 1.13 1.09 0.49 0.09
NS SR Jensen-Shannon 0.29 0.26 0.16 0.28 0.24 0.20 0.68
NS SR Kolmogorov–Smirnov 0.57 0.57 0.23 0.57 0.56 0.27 0.99
NS SR Kullback-Leibler 15.43 13.12 13.30 15.90 10.08 16.29 0.73
NS SR Squared-Chi 0.88 0.80 0.47 0.88 0.78 0.58 0.94

References

Campione, E., & Véronis, J. (1998). A multilingual prosodic database, Proc. of the 5th International Conference on Spoken Language Pro cessing (ICSLP’98), Sydney, Australia, 3163-3166.

Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods, 45(1), 83-97.

Hall, P., & York, M. (2001). On the calibration of Silverman’s test for multimodality. Statistica Sinica, 11, 515-536.

Hartigan, J. A., & Hartigan, P. M. (1985) The Dip Test of Unimodality. Annals of Statistics 13, 70–84.

Hartigan, P. M. (1985) Computation of the Dip Statistic to Test for Unimodality. Applied Statistics (JRSS C) 34, 320–325.

Le, V. B., Tran, D. D., Castelli, E., Besacier, L., & Serignat, J. F. (2004). Spoken and Written Language Resources for Vietnamese. In LREC. 4, pp. 599-602.

Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., & Pirrelli, V. (2014). The PAISÀ Corpus of Italian Web Texts. In Proceedings of the 9th Web as Corpus Workshop (WaC-9). Association for Computational Linguistics, Gothenburg, Sweden, 36-43.

New B., Pallier C., Ferrand L., & Matos R. (2001). Une base de données lexicales du français contemporain sur internet: LEXIQUE 3.80, L’Année Psychologique, 101, 447-462. http://www.lexique.org.

Oh, Y. M. (2015). Linguistic complexity and information: quantitative approaches. PhD Thesis, Université de Lyon, France. Retrieved from http://www.afcp-parole.org/doc/theses/these_YMO15.pdf

Perea, M., Urkia, M., Davis, C. J., Agirre, A., Laseka, E., & Carreiras, M. (2006). E-Hitz: A word frequency list and a program for deriving psycholinguistic statistics in an agglutinative language (Basque). Behavior Research Methods, 38(4), 610-615.

Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds.) WaCky! Working papers on the web as corpus, Gedit, Bologna, http://corpus.leeds.ac.uk/queryzh.html.

Silverman, B.W. (1981). Using Kernel Density Estimates to investigate Multimodality. Journal of the Royal Statistical Society, Series B, 43, 97-99.

Váradi, T. (2002). The Hungarian National Corpus. In LREC.

Zséder, A., Recski, G., Varga, D., & Kornai, A. (2012). Rapid creation of large-scale corpora and frequency dictionaries. In Proceedings to LREC 2012.

Appendix I: R session info

This document was compiled on:

R version 3.4.4 (2018-03-15)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: grid, parallel, splines, stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: ggpubr(v.0.2), magrittr(v.1.5), ggrepel(v.0.8.0), RColorBrewer(v.1.1-2), dplyr(v.0.8.0.1), ggridges(v.0.5.1), broman(v.0.68-2), philentropy(v.0.3.0), diptest(v.0.75-7), pander(v.0.6.3), moments(v.0.14), sjPlot(v.2.6.2), sjstats(v.0.17.3), gamlss.mx(v.4.3-5), nnet(v.7.3-12), gamlss(v.5.1-2), nlme(v.3.1-137), gamlss.dist(v.5.1-1), MASS(v.7.3-51.1), gamlss.data(v.5.1-0), lmerTest(v.3.1-0), lme4(v.1.1-20), Matrix(v.1.2-15), plyr(v.1.8.4), reshape2(v.1.4.3), ggplot2(v.3.1.0) and RhpcBLASctl(v.0.18-205)

loaded via a namespace (and not attached): numDeriv(v.2016.8-1), tools(v.3.4.4), TMB(v.1.7.15), backports(v.1.1.3), R6(v.2.4.0), sjlabelled(v.1.0.16), lazyeval(v.0.2.1), colorspace(v.1.4-0), withr(v.2.1.2), tidyselect(v.0.2.5), mnormt(v.1.5-5), emmeans(v.1.3.2), compiler(v.3.4.4), sandwich(v.2.5-0), labeling(v.0.3), scales(v.1.0.0), mvtnorm(v.1.0-8), psych(v.1.8.12), stringr(v.1.4.0), digest(v.0.6.18), foreign(v.0.8-71), minqa(v.1.2.4), rmarkdown(v.1.11), stringdist(v.0.9.5.1), pkgconfig(v.2.0.2), htmltools(v.0.3.6), highr(v.0.7), pwr(v.1.2-2), rlang(v.0.3.1), generics(v.0.0.2), zoo(v.1.8-4), modeltools(v.0.2-22), bayesplot(v.1.6.0), Rcpp(v.1.0.0), munsell(v.0.5.0), prediction(v.0.3.6.2), stringi(v.1.3.1), multcomp(v.1.4-8), yaml(v.2.2.0), snakecase(v.0.9.2), sjmisc(v.2.7.7), forcats(v.0.4.0), crayon(v.1.3.4), lattice(v.0.20-38), ggeffects(v.0.8.0), haven(v.2.1.0), hms(v.0.4.2), knitr(v.1.21), pillar(v.1.3.1), estimability(v.1.3), codetools(v.0.2-16), stats4(v.3.4.4), glue(v.1.3.0), evaluate(v.0.13), data.table(v.1.12.0), modelr(v.0.1.4), nloptr(v.1.2.1), gtable(v.0.2.0), purrr(v.0.3.0), tidyr(v.0.8.2), assertthat(v.0.2.0), xfun(v.0.5), coin(v.1.2-2), xtable(v.1.8-3), broom(v.0.5.1), coda(v.0.19-2), survival(v.2.43-3), tibble(v.2.0.1), glmmTMB(v.0.2.3) and TH.data(v.1.0-10)

Appendix II: Figures for the main paper

Here we generate the figures used in the main paper (saved to the ./figures folder as 600 DPI TIFF files Figure-*.tiff).